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A system and method of hindering an undesirable transmis- 
sion or receipt of electronic messages within a network of 
users includes the steps of determining that transmission or 
receipt of at least one specific electronic message is unde- 
sirable; automatically extracting detection data that permits 
detection of the at least one specific electronic message or 
variants thereof; scanning one or more inbound and/or 
outbound messages from at least one user for the presence 
of the at least one specific electronic message or variants 
thereof; and taking appropriate action, responsive to the 
scanning step. 
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SYSTEM AND METHOD FOR HINDERING 8*8, a recent upgrade of the SendMail mail transfer protocol, 

UNDESIRED TRANSMISSION OR RECEIPT which provides a facility for blocking mail relay from 

OF ELECTRONIC MESSAGES specified sites, or alternatively from any site other than those 

explicitly allowed). Another type of solution is an e-mail 
5 filtering service, e.g., the one offered by junkproof corn, 

FIELD OF THE INVENTION wn ich fines users who send UCE. Bright Light Technologies 

The present invention relates generally to digital data P ro P oses t0 a P roduct ^ a f Ivicc - 

processors and networks of intercommunicating digital data However they may be packaged, the vast majority of these 

processors capable of sending and receiving electronic mail solutions are composed of two main steps: recognition and 

and other types of electronic messages. In particular, the 10 response. In the recognition step, a given e-mail message is 

present invention relates to a system and method for auto- examined to determine whether it is likely to be spam. If the 

matically detecting and handling unsolicited and undesired message is deemed likely to be spam during the recognition 

electronic mail such as Unsolicited Commercial E-mail step, then some response is made. Typical responses include 

(UCE), also referred to as "spam." automatically deleting the message, labeling it or flagging it 

15 to draw the user's attention to the fact that it may be spam, 

BACKGROUND OF THE INVENTION pi acing i t i n a lower priority mail folder, etc., perhaps 

Every day, millions of Internet users receive unwelcome C0U P led wth sending a customizable message back to the 

electronic messages, typically in the form of electronic mail sender. 

(e-mail). The most familiar example of these messages is The main technical challenges lie in the recognition step. 

Unsolicited Commercial E-mail (UCE), commonly referred 20 TV/o of the most important challenges include keeping the 

to as "spam." UCE typically promotes a particular good, rates of false positives (falsely accusing legitimate mail as 

service or web site, and is sent indiscriminately to spam) and false negatives (failing to identify spam as such) 

thousands, or even millions, of people, the vast majority of as low as possible. A wide variety of commercial and 

whom find the UCE annoying or even offensive. UCE is freeware applications employ combinations and/or varia- 

widely perceived as a significant problem. Articles concern- 25 tions on the following basic spam detection strategies to 

ing UCE appear on an almost daily basis on technology address the general problem, 

news services, such as CNET Several commercial and . 

shareware products have been written to reduce e-mail Domain-based Detection 

users' exposure to UCE. At least one start-up company, 3Q Often, persons who send spam ("spammers") set up 

Bright Light Technologies, has been founded for the sole special Internet address domains from which they send 

purpose of producing and selling technology to detect and spam. One common anti-spam solution is to maintain a 

filter out UCE. Legal restrictions are being contemplated by blacklist of "spam" domains, and to reject, not deliver or 

several states, and actually have recently been put in place return to the sender any mail originating from one of these 

in more than one state. 35 domains. When spam begins to issue from a new "spam" 

Other forms of undesired e-mail include rumors, hoaxes domain, that domain can be added to the blacklist, 

and chain letters. Each of these forms of e-mail can prolif- For example, xmission.com has modified sendmail.cf 

erate within a network of users very quickly. Rumors can niies to cause mail from named sites to be returned to the 

spread with much vigor throughout a user population and sender. Their text file (http:H/spam .abuse. net/spam/tools/ 

can result in wasted time and needless concern. The most ^ dropbad.txt) lists several domains that are known to be set 

successful computer virus hoaxes have a longevity compa- u p solely for use by spammers, including moneyworld.com, 

rable to that of computer viruses themselves, and can cause cyberpromo.com, bulk-e-mail.com, bigprofits.com, etc. At 

a good deal of panic. Finally, circulation of chain letters is http://www.webeasy.com:8080/spam/spam_download„ 

a phenomenon that is serious enough to be forbidden by table, one can find just over 1000 such blacklisted sites, 

company policies or even federal laws. 45 Recent versions of SendMail (versions 8.8 and above) have 

A somewhat different class of e-mail, the transmission or been modified to facilitate the use of such lists, and this has 

receipt of which is often undesirable, is confidential e-mail. been regarded as an important development in the battle 

Confidential e-mail is not supposed to be forwarded to against spam. 

anyone outside of some chosen group. Therefore, there is a However, if used indiscriminately, this approach can lead 

concern for controlling the distribution of these messages. 50 to high rates of false positives and false negatives. For 

A common characteristic of UCE and electronically- instance, if a spammer were to send spam from the aol.com 

borne rumors, hoaxes, and chain letters is that there is likely domain, aol.com could be added to the blacklist. As a result, 

to be wide-spread agreement that the content of the message millions of people who legitimately send mail from this 

in question (and, thus, transmission thereof) is undesirable domain would have their mail blocked. In other words, the 

(as opposed to merely uninteresting). This, along with the 55 false positive rate would be unacceptably high. On the other 

fact that such messages are in electronic form, makes it hand, spammers can switch nimbly from a banned domain 

possible to contemplate various technologies that attempt to to a non-banned, newly-created one, or one that is used by 

automatically detect and render harmless this e-mail. many legitimate users, thus leading to many false negatives. 

To date, UCE has been the exclusive focus of such efforts. 

Existing UCE solutions take a number of different forms. 60 Header-based Detection 

Some are software packages designed to work with existing A hallmark of spam is that it is sent to an extremely large 

e-mail packages (e.g., MailJail, which is designed to work number of recipients. There are often indications of this in 

with the Eudora mail system) or e-mail protocols (e.g., Spam the header of the mail message that can be taken as evidence 

Exterminator, which works for any e-mail package that that a message is likely to be spam. For example, the long 

supports the P0P3 protocol on the Windows 95, Windows 65 list of recipients is typically dealt with by sending to a 

98 or Windows NT platforms). Other solutions are inte- smaller set of collective names, so that the user's explicit 

grated into widely used mail protocols (e.g., SendMail v. e-mail address does not appear in the To: field. 
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Ross Rader of Internet Direct (Idirect) has published known computer viruses, typically by scanning host pro- 
directions for setting up simple rules based on this charac- grams for special "signature" byte patterns that are indica- 
teristic of spam for a variety of popular e-mail programs, tive of specific viruses. Generic recognition techniques are 
including Eudora Light, Microsoft Mail and Pegasus. When attractive because they can catch new, previously unknown 
a mail message header matches the rule, that mail is auto- 5 spam. However, as indicated hereinabove, their disadyan- 
matically removed from the user's inbox and placed in a > *at they tend to yield unacceptably high fake 
special folder where it can be examined later or easily P 0SltIve rates and > in 1l so i ne c f es ; unacceptably high false 
deleted without insoection negative rates as well. Specific detection techniques typi- 
deleted witnout inspection. ^ ^ ^ posii{ve and faIse negative rates , but 

However, unless the user of this method puts a great deal fe • more frequent updating than do generic techniques, 

of effort into personalizing these detection rules the false 10 detection tedmi are even less mi tQ be 

positive rate has the potential to be quite high so that a large M ^ ^ other t of undes ir a ble e-mail, 

proportion of legitimate e-mail will be classified as spam. ^ ^ mmQ ^ ^ ^ of confidential 

Text-based Kevword Detection e ' mail - Recognition based on the sender's domain or other 

15 aspects of the mail header is unlikely to work at all. Genenc 

Spam is typically distinguished from ordinary e-mail in recognition of hoaxes and chain letters on the basis of 

that it aggressively tries to sell a product, advocate visiting keywords or keyphrases present in the message body may be 

a pornographic web site, enlist the reader in a pyramid possible, but is likely to be more difficult than for spam 

scheme or other monetary scam, etc. Thus, a piece of mail because the range in content is likely to be broader. Generic 

containing the text fragment "MAKE MONEY FAST" is 20 recognition of confidential e-mail on the basis of text is 

more likely to be spam than one that begins "During my almost certainly impossible because there is nothing that 

meeting with you last Tuesday." distinguishes confidential from non-confidential text in a 

Some anti-spam methods scan the body of each e-mail to way that is recognizable by any machine algorithm, 

detect keywords or keyphrases that tend to be found in spam, Bright Light Technologies promotes a different anti-spam 

but not in other e-mail. The keyword and keyphrase lists are 2 5 product/service. Bright Light uses a number of e-mail 

often customizable. This method is often combined with the addresses (or "probes") throughout the Internet which, in 

domain- and header-based detection techniques described theory, receive only undesirable messages since they are not 

hereinabove. Examples of this technology include junkfilter legitimate destinations. The messages received are read by 

(http://www.pobox.com/gsutter/junkmaU), which works operators located at a 24-hour a day operations center. These 

with procmail, Spam Exterminator and SPAM Attack Pro!. 30 operators evaluate the messages and update rules which 

Again, false positives may occur when ordinary e-mail controi a spam-blocking function in a mail server that serves 

messages contain banned keywords or keyphrases. This a & 0U V of users ' 

approach is prone to false negatives as well because the list While this method of UCE detection and response is 

of banned keyphrases would have to be updated several inherently less vulnerable to false positives and false nega- 

times per day to keep up with the influx of new instances of 35 tives because it uses specific rather than genenc detection, it 

spam, and this is both technically difficult for the anti-spam suffers from some drawbacks. Many of these stem from the 

vendor and unpalatable to the user. considerable amount of manual effort required to mamtam 

the service. The Bright Light operations center must employ 

Text-based Machine Classification experts who monitor streams of e-mail for spam, manually 

40 extract keywords and keyphrases that they believe to be 

Spam Be Gone! is a freeware product that works with good indicators of specific instances of spam, and store these 

Eudora. It uses an instance-based classifier that records keywords or keyphrases in a database. As it would most 

examples of spam and non-spam e-mail, and measures the be pron ibitive for any company to support such a set 

similarity of each incoming e-mail to each of the instances, 0 f eX p erts 0 n its own, any company wishing to protect itself 

combining the similarity scores to arrive at a classification of 45 m tnis way would ^ entirely dependent on continued, 

the e-mail as spam or non-spam. The classifier is trained uninterrupted service by Bright Light's operations center. At 

automatically for each individual user. It typically takes the least ^me companies might well prefer a solution that 

user several weeks to a few months to develop a classifier. allows for greater freedom from an external organization, 

After a sufficient amount of training, the false positive and and greater customization than is likely to be achieved by a 

false negative rates for this approach are claimed to be lower 50 single organization. The crux of the problem is that Bright 

than for other techniques. In one cited case (http:// Light's method couples two tasks that ought to be indepen- 

www.internz.com/SpamBeGone/stats.html), which can be dent of one another: labeling a message as undesirable, and 

assumed to be an upper bound on the performance since an extracting a signature from the undesirable message. If it 

average over several users is not provided, the false negative were possible to reduce the requirement for manual input to 

rate was less than a few tenths of a percent after one or two 55 that of labeling undesirable messages, this would enable 

months of training, while the false positive rate was 20% localized collaborative determinations of undesirable mes- 

after one month and 5% after two months. Thus, even in the sages. Furthermore, Bright Light does not describe a process 

best case, 1 of every 20 messages labeled as Spain will, in by which experts extract auxiliary data that permit possible 

fact, be legitimate. This could be unacceptable, particularly matches based on keywords or phrases to be tested more 

if the anti-spam software responds in a strong manner, such 60 stringently by exact or approximate matching to entire 

as automatically deleting the mail or returning it to the specific messages (or large portions of them). Thus their 

sender. specific solution is likely to be more vulemable to false 

All of the above UCE detection methods are "generic" in positives than one in which individual users would have the 

the sense that they use features that are generic to spam but opportunity to specify more stringent conditions for message 

much less common in ordinary non-spam e-mail. This is in 65 matching. 

contrast to "specific" detection techniques that are com- Another drawback is that the Bright light solution is 

monly employed by anti-virus programs to detect specific specifically targeted at UCE, as opposed to the broader class 
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of undesirable messages that includes hoaxes, chain letters, to the scanning step. Preferably, the method further includes 

and improperly forwarded confidential messages. Taken the step of storing the extracted detection data, 

together, probe accounts may receive a reasonable fractional Preferably, the determining step comprises the step of 

of all UCE, but it is unclear that they would attract chain receiving notification that proliferation of the at least one 

letters and rumors. 5 specific electronic message is undesirable. The receiving 

It is, thus, an object of the present invention to provide an step preferably includes the step of receiving a signal from 

automatic, non-generic procedure for detecting and handling an alert user identifying the at least one specific electronic 

instances of all types of undesirable mail, with very low message as undesirable or confidential. The at least one 

false positive and false negative rates. specific electronic message can be received in an inbox of 

A further object of the present invention is to provide an 10 the alert user. The receiving step preferably includes the step 

inexpensive solution which involves no staffing, but rather of providing an identifier for the alert user to mdicate that the 

utilizes the users themselves to actively identify UCE. specific electronic message is to be flagged as undesirable. 

Astill further object of the present invention is to provide 11 18 P referable that the providing step comprises the step of 

a system and method for preventing the undesired transmis- „ providing a generic detector to aid in identification of 

sion and/or receipt of confidential e-mail messages. 35 desirability of electronic messages. 

The extracting step of the present invention preferably 

SUMMARY OF THE INVENTION includes the step of extracting, from the at least one specific 

Tne present invention provides an automated procedure ele f messa S e > si ^ ature information. The storing step 
for detecting and handling UCE and other forms of unde- 20 Preferably comprises the step of adding, responsive to the 
sirable e-mail accurately, with low false negative rates and scanning step, information pertaining to the at least one 
very low false positive rates. In contrast to existing generic specific electronic message to the signature information The 
detection methods, the present invention uses a specific s ^ nat y rc formation preferably includes a signature from 
detection technique to recognize undesirable messages. In the at ^ °° e specific electronic message. The storing .step 
other words, the system of the present invention efficiently 25 can * dud f ste P? f stonn f the signature in at least one 
detects undesirable messages on the basis of their exact or signature database The signature database preferably corn- 
close matches to specific instances of undesirable messages. P nses a plurality of signature dusters, each cluster including 
In contrast to the specific technique use by Bright Light, the data correspondmg to substantially similar e ectronic mes- 
character strings used to identify specific undesirable mes- **&*• Each of the sl S nature ters P referabl y «mpn*s a 
sages are derived completely automatically, and are supple- 30 character sequence component havmg scanning ^formation 
mented with auxiliary data mat permit the end user to tune and a ° component havmg identification informa- 
the degree of match required to initiate various levels of tion about particular signature variants^ Hie scanning infcr- 
response. A further point of contrast is that the automatic maUon P r f ^ lDC ! udes a search character sequence for 
derivation of signature data permits greater flexibility a particular electromc message and extended character 
because the only required manual input is the labeling of a 35 Wncc information for all the electronic messages repre- 
particular message as undesirable. This permits ordinary «f*«d 1Q * e cluster ^ ** idenUfication infonna- 
Ssers to work collaboratively to define undesirable "> cludes a P™ er to a m ™ stored ™W of «* 
messages, freeing them from dependence on an external, el f ro ™ mes ff t0 a P articular signature variant, 
centralized operations center where experts must manually a hashblock of the electronic message, and alert data cor- 
label and exu-act signatures from undesirable messages. It 40 responding to specific instances ; where a copy of the elec- 
also permits authorities on hoaxes and chain letters to tronic message was received and the proliferation of which 
identify messages containing them, without further impos- was re P° rted 35 ^desirable by an alert user, 
ing the burden of extracting a signature, which would The extracting step and the scanning step of the present 
require a very different sort of expertise. Another point of invention can occur simultaneously and asynchronously 
contrast is that the extracted signature data can permit users 45 across the network of users. 

to define independent, flexible definitions of what consti- The method of present invention can further include the 

tutes a given level of match, ranging from matching a step of confirming, before the scanning step, the undesir- 

signature to matching an entire message verbatim. ability of the at least one specific electronic message. The 

The method of the present invention includes, when a first confirming step preferably comprises the step of confirming, 

("alert'*) user receives a given instance of undesirable mail, 50 with a generic detection technique, the undesirability of the 

labeling the message as undesirable, extracting a signature at least one specific electronic message. The method of 

for the message, adding the signature to a signature claim 16 wherein the confirming step comprises the step of 

database, periodically scanning a second (possibly including requiring that a predetermined threshold number of users 

the same) users messages for the presence of any signatures signal that the at least one specific electronic message is 

in the database, identifying any of the second user's mes- 55 undesirable. 

sages that contain a signature as undesirable and responding The extracting step preferably comprises the steps of: 

appropriately to any messages so labeled. scanning the specific electronic message for any signatures 

Specifically, the method of hindering an undesirable in the at least one signature database; and comparing, 

transmission or receipt of electronic messages within a responsive to finding a matching signature in the scanning 

network of users, includes the steps of: determining that 60 step, the matching signature to each message variant in a 

transmission or receipt of at least one specific electronic matching cluster. The comparing step preferably comprises 

message is undesirable; automatically extracting detection the steps of: computing a hashblock for the specific elec- 

data that permits detection of the at least one specific tronic message; and comparing the computed hashblock 

electronic message or variants thereof, scanning one or more with variant hashb locks in the identification information of 

inbound and/or outbound messages from at least one user for 65 each archetype component. It is preferable that the method 

the presence of the at least one specific electronic message of the present invention further comprise the steps of: if an 

or variants thereof, and taking appropriate action, responsive exact variant hashblock match is found, retrieving the full 
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text stored copy of the variant match using the pointer, and gram of instructions executable by the machine to perform 

if the full text stored copy of the variant match and the full method steps for hindering an undesirable transmission or 

text of the specific electronic message are deemed suffi- receipt of electronic messages within a network of users, the 

ciently similar to regard the specific electronic message as method comprising the steps of: determining that transmis- 

an instance of the variant, extracting alert data from the 5 s j 0 n or receipt of at least one specific electronic message is 

specific electronic message and adding it to the alert data for undesirable; automatically extracting detection data that 

the variant match; else if an exact variant hashblock match permits detection of the at least one specific electronic 

is not found or the full text of the specific electronic message message or variants thereof; scanning one or more inbound 

is found to be insufficiently similar to any of the variants in and/or outbound mesS ages from at least one user for the 

the database, determining whether the specific electronic M m Qf ^ &t kagt one mc electronic message or 

message is sufficiently similar to any existing cluster; if the * appropriate action, responsive to 

specific electronic message is sufficiently similar to an ^ scannin ste » rr r ** 

existing cluster, computing new identification information e scanmn S s e P- 

associated with specific electronic message; else if the Finally, the present invention also includes a system for 

specific electronic message is not determined to be suffi- hindering an undesirable transmission or receipt of elec- 

ciently similar to an existing cluster, creating a new cluster 35 tronic messages within a network of users, comprising: 

for the specific electronic message. The determining step means for determining that transmission or receipt of at least 

preferably comprises the steps of: computing a checksum of one specific electronic message is undesirable; means for 

a region of the specific electronic message indicated in the automatically extracting detection data that permits detec- 

extended character sequence information for each cluster; tionof the at least one specific electronic message or variants 

and comparing the computed checksum with a stored check- 20 thereof, means for scanning one or more inbound and/or 

sum in the extended character sequence information of each outbound messages from at least one user for the presence 

cluster. The method preferably further comprises the step of 0 f mc at Jeast one spec ifi c electronic message or variants 

creating, if no signature match is found, a new cluster for the thereof; and means for taking appropriate action, responsive 

specific electronic message The extended character tQ the scanning means . otherwise, the preferable embodi- 

sequence information preferaWy includes a beginoffeet field, 25 mems Qf me m match those of ^ method of me nt 

a regionlength field and a CRC field, the method further invention 
comprising the steps of: determining, for each cluster, a 

matching region with a longest regionlength; and BRIEF DESCRIPTION OF THE DRAWING 
identifying, if the longest regionlength among all the clus- 
ters is at least equal to a specified threshold length, a longest 3Q The present invention will be understood by reference to 
regionlength cluster as an archetype cluster to which the the drawing, wherein: 

specific electronic message archetype is to be added. Finally, FIG. 1 is a block diagram of a computer system for 

the method of the present invention preferably comprises the practicing the teaching of the present invention; 

step of recomputing the scanning information of the iden- FIG 2 ^ a schematic diagram of a system environment in 

tified cluster. The alert data preferably includes a 35 which an embodiment of the present invention is appUed. 

receivetime field having a time at which a copy was origi- - . , ... c j ♦ ♦ 

" - i . i iL j *u FIG. 3 is a schematic diagram of a signature data structure 

nally received and wherein the method further comprises the - ... 4 r *u *• 

" 3 c • j- ii * ** -fi-ij«f«n of an embodiment of the present invention; 

steps of: periodically comparing the receivetime field of all r . 

variants of each signature cluster with the current time; and FIG. 4 is a flow diagram of the signature extraction phase 

removing a signature cluster in which none of the 4Q of an embodiment of the present invention; 

receivetime fields are more recent than a predetermined date FIG. 5 is a flow diagram of details of a signature extrac- 

and time. tion procedure of an embodiment of the present invention; 

The scanning step preferably comprises the steps of: and 

extracting a message body; transforming the message body FIG. 6 is a flow diagram of the signature scanning phase 

into an invariant form; scanning the invariant form for exact 45 of an embodiment of the present invention, 

or near matches to the detection data; and determining, for Throughout the figures, the same reference numerals and 

each match, a level of match. characters, unless otherwise stated, are used to denote like 

The taking step preferably comprises the step of taking features, elements, components or portions of the illustrated 

appropriate action, upon discovering the presence of the at embodiment. Moreover, while the subject invention will 

least one specific electronic message or variants thereof The 50 now be described in detail with reference to the figures, it is 

taking step can comprise the step of labeling the at least one done so in connection with preferred embodiments. It is 

specific electronic message or variants thereof as undesir- intended that changes and modifications can be made to the 

able or confidential. The taking step also can comprise the described embodiments without departing from the true 

step of removing the at least one specific electronic message scope and spirit of the subject invention as defined by the 

or variants thereof. 55 appended claims. 

The taking step preferably comprises the step of taking dfsprtption OF THE 

appropriate action for each determined level of match, ^JSSSSSSSSSt^ 

responsive to one or more user preferences and the deter- PREFERRED EMBODIMENT 

mining step preferably comprises the steps of: finding the FIG. 1 is a block diagram of a system 10 that is suitable 

longest regional matches for each match; computing hash- 60 for practicing the teaching of the present invention. A bus 12 

block similarities between a hashblock of the scanned mes- is comprised of a plurality of signal lines for conveying 

sage and hashblocks of each of the extracted detection data; addresses, data and controls between a central processing 

receiving one or more user preferences; and determining a unit (CPU) 14 and a number of other system bus units. A 

level of match responsive to the finding, computing and random access memory (RAM) 16 is coupled to the system 

receiving steps. 65 bus 12 and provides program instruction storage and work- 

The present invention also includes a program storage ing memory for the CPU 14. A signature extraction module 

device, readable by a machine, tangibly embodying a pro- and a scan/filter module 15, the methods of which are 
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described hereinbelow, can ran on CPU 14 or, alternatively, messages 1 and 2 while she reads message 3 and labels it 

on separate CPUs. A terminal control subsystem 18 is undesirable. Minutes later, user B's messages may be 

coupled to the system bus 14 and provides outputs to a scanned for the presence of undesirable messages 1, 2 and 

display device 20, typically a CRT or LCD monitor, and 3. Half an hour later, user C may discover a fourth unde- 

receives inputs from a manual input device 22, such as a 5 stable message 4, and an hour later user A's messages may 

keyboard or pointing device. A hard disk control subsystem be scanned again, this time for the presence of 1, 2,3 and 4 

24 bidirectionally couples a rotating fixed disk, or hard disk The P reseDt ln ^ ntlon P u rov,d f s for me «™8 of outbound 

26, to the system bus 12. The control 24 and hard disk 26 messa g es 38 ™ n as mbou ' ld messages This is particularly 

provide mass storage for CPU instructions and data. A advantageous for types of messages that are likely to be 

floppy disk control subsystem 28 which, along with a floppy 10 forwarded from one user to several other users, such as 

disk drive 30, is useful as an input means in the transfer of hoaxe f» *»»» lel T' *? 00nfide u nt i al ^ssages. Catching 

computer files from a floppy diskette 30a to system memory, an undesirable outbound message before it can be forwarded 

bidirectionally couples the floppy drive 30 to the system bus « considerably more efficient than deahng with the message 

12. Finally, a communication subsystem 32 is coupled to the .* has been «* t0 what 001,1(1 te a multitude of 

system bus 14 and provides a link to networks such as the 15 recipients. 

Internet ^ preferred data structure for representing signature data 

, .„ . , , . , , that are extracted from a message in the first phase of the 

The components illustrated in FIG. 1 may be embodied ^ ^ ^ f . q ^ ^ o£ 

within a personal computer, a portable computer a ^ recognize a duplicate or similar 

worksU ion, a mmicomputer or a sup rcompute . As such, J wfl , 

the details of the physical embodiment of the data process- 20 . . A . . , i * . j . . u 

. , v * 4 . 4 „ f t . u e t ~ _ t . appreciate that more or less elaborate data structures may be 

ine system 10, such as the structure of the bus 12 or the ^. . . . TT , . U1 „ 

u f^TWT * i ^* *u u,.„ v used in the present invention. Undesirable messages are 

number of CPUs 14 that are coupled to the bus, is not crucial , A j . * * r i_ * n -i «; *u- 

" ~r r * • a * ♦ clustered into sets of substantially similar messages. Within 

to the operation of the .present invention, and is not described & ^ ^ be ^ Qf y {o ^ 

in further detail hereinbelow. archetypes. In many cases, each cluster will contain just a 
In broad terms, the method of the present invention singk archetype# However, under some circumstances 
comprises two phases. First, in a signature extraction phase, (particularly for hoaxes, which may come in several related 
an undesirable (or confidential) message that is currently variants ) it may be useful to regard slight variants of a 
unrecognized as such by the system is labeled as undesirable message ^ belonging to the same cluster. Allowing for more 
(or confidential) by a first alert user, perhaps assisted by an ^ thaQ one a cluster enables the same signa- 
automated procedure, and certain signature data are auto- mres tQ be used tQ detect XVCI ^ different variants. This 
matically extracted from that message and placed in one or resuhs in more efficient st0 rage and somewhat faster 
more databases distributed to the user population. Second, in ^ it ^ makes it more ^ely that new variants 
a signature scannmg phase, at least one user's set of mes- ^ ^ recognized as sucht Furthermore, the sophisticated 
sages (possibly including the first alert user's set) is scanned ^ natUfe of the signature extraction data of the present inven- 
ting the extracted signature data in an effort to find ^ provides for flexibility in tuning the system so that a 
instances of the substantially similar messages, and an trade-off is made between detecting variants and reducing 
appropriate action is taken whenever such messages are false-positives. 

encountered. A s i gnature database of one embodiment of the present 

FIG. 2 shows a computer system environment in which 4Q mve ntion consists of a set of archetype Clusters, each 

one embodiment of the present invention that specifically distinguished by a unique ClusterlD identifier. Each Cluster 

addresses spam is applied. A spammer 200 transmits spam 300 has two basic components. The first component is 

202 to company A 204 and company B 206. In practice, the sigList 302. SigList 302 is a list of SigData elements 304, 

spam 202 would be sent to many different companies. eacn 0 f wn i cn contains information pertaining to specific 

Assuming that company A204 utilizes the present invention, 45 c h aracter sequences found in members of the archetype 

the spam 202 could be received at the mail server 208 in cluster 300. Three SigData elements, SigDatal, SigData2 

which one or more users maintain accounts. Assuming that an d SigData3, are shown. Each SigData element 304 in the 

user A 210 accesses his/her mail, the spam 202 is found in SigList 302 contains two parts. For illustration, only Sig- 

his/her list of incoming mail. In response to user A 210 Data 2 ^ expanded. The first part of SigData2 304, Sig2 306, 

identifying the spam 202 as such, the identified spam 212 is 5Q ^ a re i at ively short textual pattern that will be searched for 

labeled as such and the signature extraction phase of the by the message scanner. The second part, RegionList2 308, 

present invention is commenced. ^ a ii st 0 f RegjonData elements 310 associated with Sig2 

In the signature extraction phase of the present invention, 306, each of which contains information about a longer 

the identified spam 212 could be forwarded by the mail character sequence contained in all archetypes in the cluster, 

server 208 to a signature extraction engine 214. Once 55 Each RegionData element 310 contains three elements: 1) 

extracted by the signature extraction engine 214, the signa- BeginOffset 312, an offset in bytes of the beginning of the 

ture of the identified spam 212 is returned to the mail server character sequence from the beginning of the signature; 2) 

208 and stored in a signature database 216. In the signature RegionLength 314, the number of characters in the character 

scanning phase of the present invention, the incoming (or sequence; and 3) CRC 316, a checksum of the character 

outgoing) messages of user B 218 and user C 220 are 6 q sequence. 

scanned using the extraction signature data in the signature The second component of each Cluster 300 is Arche- 

database 216. Here, instances of substantially similar mes- typeList 318. ArchetypeList 318 is a list of ArchetypeData 

sages 222 are flagged for the users, eliminated from their elements 320, each of which contains data pertaining to a 

inboxes or prevented from being transmitted. particular archetype. In particular, each ArchetypeData ele- 

The two phases may operate simultaneously and asyn- 65 ment 320 may contain: 1) ArchetypePtr 322, which is a 

chronously across a user population. For example, user A pointer to a stored copy of an archetype message so that its 

could have his messages scanned for known undesirable full text can be retrieved as needed; 2) HashBlock 324, 
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which is a block of data computed from the body of the mination is made as to whether the new archetype is 

archetype, and used to measure overall similarity to other sufficiently similar to an existing cluster of archetypes and, 

messages; and 3) CaseList 326, which is a list of CaseData if so, which cluster. Preferably, for each Cluster that contains 

elements 328, each of which contains data pertaining to a matching signature in one of its Sig components, each 

specific instances where a copy of the archetype was/ 5 RegionData element 310 in the RegionList 308 associated 

received and reported as undesirable by a user. In particular, with that Sig 306 is compared with the message Ml by 

each CaseData element 328 may contain: 1) SendID 330, computing the checksum of the region indicated by Begj- 

which is the identity of the sender of the copy; 2) RecvID nOffiset 312 and RegionLength 314, and a match is declared 

332, the identity of the recipient who reported the copy; and if the checksum of that region within the message is equal 

3) Recvlime 334, the time at which the copy was originally 10 to the value stored in CRC 316. The matching region with 

received. the longest RegionLength 314 is determined for each Clus- 
ter. If the longest RegionLength 314 among all Clusters is at 

Signature Extraction fcast tQ a specified mresno i d i en g th> then t be Cluster 

A preferred embodiment of the signature extraction phase with the longest RegionLength 314 is identified as the 

of the present invention, during which a method for detect- is archetype cluster to which the new archetype should be 

ing a specific, previously unknown undesired (or added. Thus, at step 412, the archetype data are computed 

confidential) message is derived and disseminated to a and added, as a new ArchetypeData element (with all 

network of users, is described with reference to FIG. 4. The substructures filled with the required information), to this 

present invention can be used in an environment with one or Cluster's ArchetypeList. 

more mail users. As the number of mail users increases, the 20 Optionally, at step 414, the Ouster's SigList 302 may be 

advantages of the present invention increase. In step 400, a recomputed to reflect the addition of a new archetype to the 

first (alert) user receives a message Ml. The user reads the cluster. A matching algorithm (such as a suffix array routine) 

received message Ml and, if he believes it to be "undesir- can be vscd to identify one or more sequences of characters 

able" in the sense that it is likely to be widely circulated and f ound amon g all of the archetypes, and the derivation of the 

widely held to be unwelcome (or that it is confidential), that 2 5 SigList data detailed hereinbelow with reference to FIG. 5 

user indicates to the system that the message Ml is to be can be applied on i y t o the set of commonly occurring 

flagged as undesirable (or confidential), e.g., by clicking a character sequences, rather than to the entire message body. 

special button in the user interface. Optionally, a generic ^ metno d continues in step 418. 

detection method may be used to help the user identify the . . AA . .« . r , . 

uiviii^w maj uv u^vi v f If at step 405, the message Ml is found to contain none 

message as undesirable m the first place. In any case, 11 the 30 * *u * * * j * u m 

, 6 , " , , K , 7 .... of the signatures in the master signature database Dl or if no 

user has indicated .to .the system that the message should be ^ & found {o ^ suffidentl close to the new 

flagged as "undesirable at step 402, a copy of the message ^ ^ ^ method ^ ^ 

Ml is sent and/or uiput to an automatic signature extaacUon M 4U a flew ^ b for , he 

procedure in step 404. Optionally ,m step ^403 identification m a ^ Data element containing the 

o the message as undesirable can be confirmed I in a number 3 5 ir * d inform a, ion is c reatc d and placed in the 

of ways. The confirmation could be provided by an autho- ^ ^ a ^ rf ^ ^ aaoM ^ ^ 

nzed human user. It could be given only alter a threshold : r , J. , . e : rz . . u u , 

, r Ti i , , j L . . computed and placed in SigList. Finally, the archetype 

number of users have all labeled that message as undesir- n , r . . . * , „ • „r ™,ct„,Tr* ^aa^a 

, , „ „ . ._, , . , , , & 4 4 . . Cluster is assigned its unique ClusterlD and added to the 

able. Finally, .t could be provided by a separate automated si ^database Dl. The signatures in SigList are 

process (e.g. one that uses a generic technique to detect « compute d automatically by an automatic signature extrac- 

spam). If confirmation that the message >s undesirable is uon F dure that ^ character xqueBces that are 

provided, the method would contmue at step 404^ By ^ ^ be fouad ^ o(her ffl s ^ on a 

permitting the mail system users themselves to identify the ^ for ^ ^ ^ ^ hereinbelow 

undesirable or confidential messages dependence upon ^ reference tQ nQ * A gi ^ of a 

experts at a centralized operations center is avoided. 45 ^ of characters> or mote generally a pattern of 

At step 404, the message Ml is scanned for the presence cnar acters, found ^ the mesS age itself or in a preprocessed- 

of any signatures contained in a master signature database versk)n of the message . It may ^ aceompan ied by additional 

Dl. If, at step 405, the message Ml is found to contain at information suc h as checksums of the entire message and/or 

least one of the signatures in the master signature database onions of it> checksums or otner compressed data strings 

Dl, then at step 406, the message * compared wuh each so derived ^ one of mofe transformations of the message, 

archetype associated with each Cluster that contains a This addiUonal informaUon may be stored in the RegionList 

matching signature in one of its Sig components to deter- m assodiled ^ each si g nature as illustrated in FIG. 3. 

mine if a match with any archetype in Dl exists. A preferred „. „ . .,_ . . . . . 

,u a c • . ~ „ u„oum„„C r„, Finally, in step 418, local signature databases serving one 

method of comparison is to compute a HashBlock for the '..■•/« j j . j . a . .i_ 

message and to compare this HashBlock with the HashBlock 55 or more individual user nodes are updated to reflect the 

for each candidate archetype. If an exact archetype match is »P da K tes *™ '£f . miS *? 
found (e.g., if the hashblVck distance is computed to be database Dl al 1 steps .408 414 or 416. This can be achieved 
zero), then the matching candidate's ArchetypePtr 322 is bv standar ^ da u ta ^ ase 1 "P da ' m S or «P^o° tech- 
used to retrieve its full text. Finally, if the full texts of the mques to ensure that the l^al databases are exact rephcas of 
archetype and the message are deemed sufficiently similar to 60 ,h f master or b r selectively sendmg or 
regard the message as an instance of the archetype, then at selecuvely receiving and mcorporaUng signatures and asso- 
step 408, the relevant CaseData information 328 is extracted ciated auxlllai y data ^ dm ^ l ° a «» <* cr,,ena m ^ 
from the message and added to the CaseList 326 in Dl for var V across local s, S nature databases, 
that archetype. Control then passes to step 418. However, if Derivation of SigList Data 
at step 406, an exact archetype match is not found or the full 65 

text of the message is determined to be insufficiently similar A preferred embodiment of the procedure for extracting or 

to the full text of the archetype, then at step 410, a deter- computing the SigList data for a given archetypal message, 
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employed in steps 414 and 416, is now described with 412 and 416, is now described. First, the message body is 
reference to FIG. 5. First, at step 500, the number of transformed. The transformation may be the same as or 
occurrences of all byte sequences less than or equal to a different from the transformation applied to the message 
chosen threshold length within a corpus of mail messages is body prior to signature extraction (step 504). For example, 
tallied. In a preferred embodiment, the threshold length is 5 the transformations could be identical, except that blank 
three, i.e. the number of occurrences of all 1-, 2-, and 3-byte spaces would be retained in the transformed message body 
sequences (referred to as 1-grams, 2-grams and 3-grams, f or p Ur p 0S es of computing the HashBlock. Then, the trans- 
respectively) is tallied. In step 501, the number of occur- formed message body is divided into small individual units 
rences tallied is then stored in compressed form in an n-gram mat or not over i ap> For example, the individual 
frequency database. Then-gram frequency database rcqmres ^ be aU 5-character sequences (which 
no more than a few mepb^ , £ Qr fa be non . overlappiog « words » 
be computed for each user individually from a corpus ^J'^ unit / delimit ed by blank spaces). Non- 
consisting of archived messages received by that user, or a V1 , . c * , . J. . , ' 
universal database could be computed from a standard overlapping units are preferable. For each individual urn , a 
corpus of generic messages culled from several users. Itts hash function maps that unit to a small integer hash value 
universal database could then be distributed throughout the « (say in the range 0-255). An array of hash value counts is 
user population. The database could be updated periodically. kept, and each time a particular hash value is computed, the 
Details of where, the database is originally produced and count for that value is incremented by 1. If the number of 
how frequently it is updated have no bearing on the remain- counts is capped at 15 or, alternatively, if it is computed 
ing steps of the signature extraction procedure. modulo 16 (that is, the recorded number is the remainder of 
At step 502, the body of the message M2 from which the 20 the actual number when divided by 16) then only 4 bits are 
signature is to be extracted is isolated. At step 504, the required for each count, and an array of 256 hash values can 
extracted body is transformed into an "invariant" form by be expressed as a HashBlock of just 128 bytes. Note that this 
removing all non-alphanumeric characters and replacing all HashBlock will be relatively insensitive to additions, dele- 
uppercase letters with their lowercase versions (see FIG. 6). tions and rearrangements of words, provided that the num- 
Next, at step 506, one or more sequences of characters that 25 ber of changes is not too great, 
are highly unlikely to be found in a typical message are 

identified. The one or more sequences constitute the signa- Prunm S of the Signature Databases 

ture or signatures. The identification of unlikely character j n order t0 preV ent unlimited growth of the master and 

sequences can be carried out by the method described in local signature databases, they may be pruned periodically 

U.S. Pat. No. 5,452,442 (442 patent) entitled "Methods and 30 tQ remove cluster data for which there have been no recent 

Apparatus for Evaluating and Extracting Signatures of Com- rep0 rted instances. Preferably, at periodic intervals (daily, 

puter Viruses and Other Undesirable Software Entities," f or example), each Cluster in the master signature database 

issued Sep. 19, 1995, which is hereby incorporated by ^ examined. All RecvTime elements 334 in the cluster 

reference. This method was originally applied to the auto- structure are compared with the current time, and if none are 

matic extraction of computer virus signatures: Several can- 35 mQre fecent man SQme gp ecified ^ ^ time) tnen tne 

didate signatures taken from the message are selected, and enlire cluster is removed from the master signature data- 

for each the n-gram statistics from the n-gram frequency b ase. T^e removal of this cluster is communicated to all 

database, they are combined using formulas found in the 442 local s ig natur e databases, and any that include this cluster 

patent to estimate the likelihood for each candidate signature caD eliminate it as well, 

to appear in a random ordinary mail message. The candidate 40 

signature or signatures with the least likelihood of appearing Signature Scanning 

in an ordinary mail message are selected. «... . e t . . 

Taken together, steps 502. 504 and 506 describe the Dunn * the *V? m scanmn 8 P hase ^J""**?* 

derivation of the text sLg element labeled Sig 306 in FIG. one or mor 5 messa 8 es ^ scanned for the possible 

3. Optionally, the fake positive rate may be reduced further « presence of specific messages that have been laWed as 

by computing a list of RegionData 310 associated with Sig parable (or as confidential). Although hundreds, thou- 

306. This mly be achieved at step 508 by the following sands or . even mi lhons of users may be protected by the 

procedure for each derived signature. A series of "regions," P~5f ™' atl0a ' U ,s „ m 5 st anient to focus on an 

each consisting of a character sequence that contains the mdmdual f OTndu ? en ,?? -T 8 T"^ 

signature, is chosen. In a preferred embodiment, the series so » local ^nature database that is continually updated as new 

consists of a first region that is roughly centered on the ""desirable messages are discovered by other users, and 

signature and approximately twice the length of the mav °e specific to a particular user or shared by several 

stature, a second region that contains the first region and users - ^ «? n m k av take P ,ace Periodically, or in response 

is roughly twice the siie of the first, and so on until the final t0 a re ? uest * the "f r ° r ^ me , othe j fTf m < s " ch » a 

region in the series consists of the entire transformed mes- 55 n ^ fi , catJ0n **t to local signature database has been 

sage body. For each region, the offset of its first character u f dated ,h , e l 85 ' ^"more, the scan may take 

r lL i . i_ * r*u /* • ii *• place at different tunes and under different circumstances for 

from the first character of the signature (typically a negative y ~r 7 , , , " 

\ * a a i ',L t L a i„Ak^r t ul ro^Z, nn A different users. In the typical case in which the messages are 

integer) is recorded, along with the length of toe region and , . , J * . , r . A 

a checksum of the region's character sequence. These three e , lectrom ' maU, he scan is apphed preferably only to those 

elements constitute the RegionData 310 for that region. The « Hems that are m the user s mbox although .t may be apphed 

, , i • * „ .1 i „„ ~ to other specified folders as well if the user so desires. 

checksum may employ any convenient method, such as a K 

cyclical redundancy check, and preferably should be at least A preferred embodiment of the scanning procedure is 

32 bits, described with reference to FIG. 6. At step 602, the body of 

the message M2 to be scanned is extracted. Then, at step 

Deriving HashBlock Data 65 tne messa g e body is transformed into the same invari- 

A preferred embodiment of the method for computing the ant form as was applied at step 504. At step 606, the 

HashBlock data for a given message, as required in steps invariant form of the message body is scanned for exact or 
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near matches to any of the signatures included in a local As an explicit example, suppose that there are four 

signature database D2, which has been constructed from all discrete levels of match: perfect, high, medium and low. 

or a portion of the Cluster data structures in one or more Then a reasonable set of user preferences might be as 

master signature databases. If no signatures are found, the follows. For a match level to be regarded as perfect, there 

message is not deemed undesirable (or confidential), and the 5 must exist a Ouster for which the HashBlock similarity 

nrocess terminates distance is zero, and for which at least two users m the 

p ' . _ j MsgList for that Cluster have a RecvID 332 within the same 

However, if one or more signatures are found at step 606, e _ mail domain as the usef otherwise, for a match level to be 

then at step 608, the auxiliary information contained m the regar d e d as high, there must exist a Cluster for which the 

associated RegionData elements 310 is used to assess the HashBlock similarity distance is less than 5 or the longest 

degree of match to one or more known undesirable mes- 10 re gj on length in BestRegionDataElements is at least 500 

sages. Specifically, for each signature Sig 306 appearing in characters, and for which at least two users in the MsgList 

the message, all Clusters in which Sig 306 appears are f or that Ouster have a RecvID 332 within the same e-mail 

considered in turn. For each such Cluster 300, the Region- domain as the user. Otherwise, for a match level to be 

List 308 associated with Sig 306 is considered. First, the regarded as medium, there must exist a Cluster for which the 

RegionData element 310 with the largest RegionLength 314 15 longest region length is at least 100 characters, and for 

is checked by computing the checksum of the corresponding which there are least two distinct users in the MsgList, with 

region within the scanned message. If the checksum matches no restrictions on domain or other characteristics, 

the CRC 316 for this RegionData element 310, this Region- Otherwise, the match level is to be regarded as low. 

Data element 310 and the associated ClusterlD are added to At step 614, another set of rules within the user's set of 

a list BestRegionDataElements, and the next Cluster is then 20 preferences is applied to the level of match determined at 

considered. If the checksum does not match, the RegionData step 612 to determine and carry out the appropriate response, 

element 310 with the next longest RegionLength 314 is Appropriate responses may mclude automatically deleting 

compared in the same way, and so on until a matching the ™f • altering its appearance in the user s inbox (for 

checksum is found If there is no matching checksum among example by annotating or colorizmg it), storing it in a special 

. n . „ t . . .- n tU tU u* f nn % i% folder, etc. For example, if the match level is perfect, the 

the RegionData elements 310, then the signature itself and 25 * . . , r ' , , \ . n 

*u ^ . , , n , m , . tn t Jt D or<Don ;. n n, user may indicate that the mail should be automatically 

the associated ClusterlD are added to the BestRegionDa- * . . , ... ... , ' 

, t A + , , ™ , . . M - AanA deleted; if the match level is high, the mail should be placed 

taElements list, and the next Ouster is considered. ' . . (( . , . „ % ' tU t , .f , . 

' . . in a special "probable spam folder; if the match level is 

At step 610, a locality-preserving hash function is used to mediuni) me mail summary appearing in the inbox should be 

compute a HashBlock for the scanned message. The Hash- cdored md the message body should ^ prefixed with 

Block of the scanned message is compared with the Hash- a brief explanation 0 f why the meS sage is believed to be 

Blocks of each Cluster that contains one of the matching closely rdated t0 a mstaiK;e of undesirable mail. The 

signatures found at step 606, and a similarity computed for usef , s preferences may also specify particular messages that, 

each such Cluster. The similarity computation may employ regardless of their level of match, are not to be regarded as 

any reasonable metric. A preferred similarity metric for two undesirab i e (such ^ ones sent by their manager or their 

HashBlocks (HI and H2) treats each as a 256-element array, compa ny's cnie f executive ofl&cer). 

each element being represented as 4 bits, and sums the Optionally, if an undesirable message has been 

absolute values of the differences between the array discovere d, then at step 616 the master signature database 

elements, i.e. the similarity S is given by may ^ updated with information about the new instance of 

40 the undesirable message. The update may occur upon 

^ (1) discovery, or alternatively may occur only after the user has 

s = Zj r " confirmed that the message is undesirable. For example, in 

the case of a perfect match, the information may consist of 
CaseData 328 for the undesirable message (i.e. the identity 

if the array elements are capped at 16, and alternatively by 45 of the sender and receiver and the time of receipt). This 

information could be extracted locally and then sent to the 

255 (2) location of the master signature database, where it would be 

s = £ ((//y - H 2 j + i6)modJ6) incorporated. In the case of a high or even a medium level 

>0 of match, the entire message might be sent to the location of 

50 the master signature database, and it would enter the signa- 

if the array elements are stored modulo 16. mre extraction phase at step 404, where an attempt would be 

The ClusterlD and the similarity S are added to a list made to create a new archetype and place it in an appropriate 

HashBlockSirmlarity, and then the next Cluster is consid- archetype cluster. 

ered until there are no more Clusters that contain one of the Now that the invention has been described by way of a 

matching signatures found at step 606. 55 preferred embodiment, various modifications and improve- 

At step 612, the BestRegionDataElements list derived ments ^ occur t0 & os * of ^ m the art - il should 

from step 608, the HashBlock-Similarity list derived from be understood that the preferred embodiment is provided as 

step 610 and a set of user preferences are combined to an example and not as a limitation. The scope of the 

determine a degree or level of match. The user preferences invention is defined by the appended claims, 

may consist of one or more thresholds for HashBlock 60 What is claimed is: 

similarity, one or more thresholds for RegionLength 314, * A method of hindering an undesirable transmission or 

and conditions on various aspects of the MsgData compo- receipt of electronic messages within a network of users, 

nent of the Cluster referred to in the BestRegionDataEle- comprising the steps of: 

ments and HashBlockSimilarity lists. In a typical determining that transmission or receipt of at least one 

application, the user preferences may be set at some default 65 specific electronic message is undesirable; 

settings which may be overridden by advanced users, if they automatically extracting detection data that permits detec- 

choose. tion of the at least one specific electronic message or 
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variants thereof, wherein said automatically extracting 
includes automatically ideniitifying and storing a text 
string signature contained within an undesirable elec- 
tronic message, said text string signature being statis- 
tically unlikely to be found in desirable electronic 5 
messages; 

scanning one or more inbound and outbound messages 
from at least one user for the presence of the at least one 
specific electronic message or variants thereof wherein 
said scanning includes searching for said text string 10 
signature within said inbound and outbound messages; 
and 

taking appropriate action, responsive to the scanning step. 

2. The method of claim 1 further comprising the step of 
storing the extracted detection data. 

3. The method of claim 1 wherein the determining step 
comprises the step of receiving notification that proliferation 
of the at least one specific electronic message is undesirable. 

4. The method of claim 3 wherein the receiving step 
comprises the step of receiving a signal from an alert user 
identifying the at least one specific electronic message as 
undesirable or confidential. 

5. The method of claim 4 wherein the at least one specific 
electronic message is received in an inbox of the alert user. 

6. The method of claim 4 wherein the receiving step 25 
comprises the step of providing an identifier for the alert user 

to indicate that the specific electronic message is to be 
flagged as undesirable. 

7. The method of claim 6 wherein the providing step ^ 
comprises the step of providing a generic detector to aid in 
identification of undesirability of electronic messages. 

8. The method of claim 2 wherein the extracting step 
comprises the step of extracting, from the at least one 
specific electronic message, signature information. 

9. The method of claim 8 wherein the storing step 35 
comprises the step of adding, responsive to the scanning 
step, information pertaining to the at least one specific 
electronic message to the signature information. 

10. The method of claim 2 wherein the extracting step 
comprises the step of extracting a signature from the at least 
one specific electronic message. 

11. The method of claim 10 wherein the storing step 
comprises the step of storing the signature in at least one 
signature database. 

12. The method of claim 11 wherein the signature data- 
base comprises a plurality of signature clusters, each cluster 
including data corresponding to substantially similar elec- 
tronic messages. 

13. The method of claim 12 wherein each of the signature 
clusters comprises a character sequence component having 
scanning information and an archetype component having 
identification information about particular signature vari- 
ants. 

14. A method of hindering an undesirable transmission or 
receipt of electronic messages within a network of users, 55 
comprising the steps of: 

determining that transmission or receipt of at least one 
specific electronic the message is undesirable; 

automatically extracting detection data that permits detec- 60 
tion of the at least one specific electronic message or 
variants thereof; 

scanning one or more inbound and/or outbound messages 
from at least one user for the presence of the at least one 
specific electronic message or variants thereof; es 

taking appropriate action, responsive to the scanning step; 
and 
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storing the extracted detection data, 

wherein the extracting step comprises the step of extract- 
ing a signature from the at least one specific electronic 
message; 

wherein the storing step comprises the step of storing the 
signature in at least one signature database; 

wherein the signature database comprises a plurality of 
signature clusters, each cluster including data corre- 
sponding to substantially similar electronic messages; 

wherein each of the signature clusters comprises a char- 
acter sequence component having scanning informa- 
tion and an archetype component having identification 
information about particular signature variants; and 

wherein the scanning information includes a search char- 
acter sequence for a particular electronic message and 
extended character sequence information for all the 
electronic messages represented in the cluster and 
wherein the identification information includes a 
pointer to a full text stored copy of an electronic 
message relating to a particular signature variant, a 
hashblock of the electronic message, and alert data 
corresponding to specific instances where a copy of the 
electronic message was received and the proliferation 
of which was reported as undesirable by an alert user. 

15. The method of claim 2 wherein the extracting step and 
the scanning step occur simultaneously and asynchronously 
across the network of users. 

16. The method of claim 4 further comprising the step of 
confirming, before the scanning step, the undesirability of 
the at least one specific electronic message. 

17. The method of claim 16 wherein the confirming step 
comprises the step of confirming, with a generic detection 
technique, the undesirability of the at least one specific 
electronic message. 

18. The method of claim 16 wherein the confirming step 
comprises the step of requiring that a predetermined thresh- 
old number of users signal that the at least one specific 
electronic message is undesirable. 

19. The method of claim 14 wherein the extracting step 
comprises the steps of: 

scanning the specific electronic message for any signa- 
tures in the at least one signature database; and 

comparing, responsive to finding a matching signature in 
the scanning step, the matching signature to each 
message variant in a matching cluster. 

20. The method of claim 19 wherein the comparing step 
comprises the steps of: 

computing a hashblock for the specific electronic mes- 
sage; and 

comparing the computed hashblock with variant hash- 
blocks in the identification information of each arche- 
type component. 

21. The method of claim 20 further comprising the steps 
of: 

if an exact variant hashblock match is found, retrieving 
the full text stored copy of the variant match using the 
pointer, and 

if the full text stored copy of the variant match and the full 
text of the specific electronic message are deemed 
sufficiently similar to regard the specific electronic 
message as an instance of the variant, extracting alert 
data from the specific electronic message and adding it 
to the alert data for the variant match; 

else if an exact variant hashblock match is not found or 
the full text of the specific electronic message is found 
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to be insufficiently similar to any of the variants in the 
database, determining whether the specific electronic 
message is sufficiently similar to any existing cluster; 

if the specific electronic message is sufficiently similar to 
an existing cluster, computing new identification infor- 5 
mation associated with specific electronic message; 

else if the specific electronic message is not determined to 
be sufficiently similar to an existing cluster, creating a 
new cluster for the specific electronic message. 

22. The method of claim 21 wherein the determining step 
comprises the steps of: 

computing a checksum of a region of the specific elec- 
tronic message indicated in the extended character 
sequence information for each cluster; and 35 

comparing the computed checksum with a stored check- 
sum in the extended character sequence information of 
each cluster. 

23. The method of claim 19 further comprising the step of 
creating, if no signature match is found, a new cluster for the 2 o 
specific electronic message. 

24. The method of claim 22 wherein the extended char- 
acter sequence information includes a beginoffset field, a 
regionlength field and a CRC field, the method further 
comprising the steps of: 25 

determining, for each cluster, a matching region with a 
longest regionlength; and 

identifying, if the longest regionlength among all the 
clusters is at least equal to a specified threshold length, 
a longest regionlength cluster as an archetype cluster to 30 
which the specific electronic message archetype is to be 
added. 

25. The method of claim 23 further comprising the step of 
recomputing the scanning information of the identified clus- 
ter. 35 

26. The method of claim 14 wherein the alert data 
includes a receivetime field having a time at which a copy 
was originally received and wherein the method further 
comprises the steps of: 

periodically comparing the receivetime field of all vari- 40 
ants of each signature cluster with the current time; and 

removing a signature cluster in which none of the 
receivetime fields are more recent than a predetermined 
date and time. 

27. The method of claim 1 wherein the scanning step 45 
comprises the steps of: 

extracting a message body; 

transforming the message body into an invariant form; 
scanning the invariant form for exact or near matches to 50 

the detection data; and 
determining, for each match, a level of match. 

28. The method of claim 1 wherein the taking step 
comprises the step of taking appropriate action, upon dis- 
covering the presence of the at least one specific electronic 55 
message or variants thereof. 

29. The method of claim 28 wherein the taking step 
comprises the step of labeling the at least one specific 
electronic message or variants thereof as undesirable or 
confidential. 60 

30. The method of claim 28 wherein the taking step 
comprises the step of removing the at least one specific 
electronic message or variants thereof, 

31. The method of claim 27 wherein the taking step 
comprises the step of taking appropriate action for each 65 
determined level of match, responsive to one or more user 
preferences. 



32. A method of hindering an undesirable transmission or 
receipt of electronic messages within a network of users, 
comprising the steps of: 

determining that transmission or receipt of at least one 
specific electronic the message is undesirable; 

automatically extracting detection data that permits detec- 
tion of the at least one specific electronic message or 
variants thereof; 

scanning one or more inbound and/or outbound messages 
from at least one user for the presence of the at least one 
specific electronic message or variants thereof; and 
taking appropriate action, responsive to the scanning 
step; 

wherein the scanning step comprises the steps of: 
extracting a message body; 

transforming the message body into an invariant form; 
scanning the invariant form for exact or near matches 

to the detection data; and 
determining, for each match, a level of match, and 
wherein the determining step comprises the steps of: 
finding the longest regional matches for each match; 
computing hashblock similarities between a hashblock 

of the scanned message and hashblocks of each of 

the extracted detection data; 
receiving one or more user preferences; and 
determining a level of match responsive to the finding, 

computing and receiving steps. 

33. A program storage device, readable by a machine, 
tangibly embodying a program of instructions executable by 
the machine to perform method steps for hindering an 
undesirable transmission or receipt of electronic messages 
within a network of users, the method comprising the steps 
of: 

determining that transmission or receipt of at least one 
specific electronic message is undesirable; 

automatically extracting detection data that permits detec- 
tion of the at least one specific electronic message or 
variants thereof, wherein said automatically extracting 
includes automatically identifying and storing a text 
string signature contained within an undesirable elec- 
tronic message, said text string signature being statis- 
tically unlikely to be found in desirable electronic 
messages; 

scanning one or more inbound and outbound messages 
from at least one user for the presence of the at least one 
specific electronic message or variants thereof wherein 
said scanning 

includes searching for said text string signature within 
said inbound and outbound messages; and taking 
appropriate action, responsive to the scanning step. 

34. A system for hindering an undesirable transmission or 
receipt of electronic messages within a network of users, 
comprising: 

means for determining that transmission or receipt of at 
least one specific electronic message is undesirable; 

means for automatically extracting detection data that 
permits detection of the at least one specific electronic 
message or variants thereof; 

means for scanning one or more inbound and/or outbound 
messages from at least one user for the presence of the 
at least one specific, electronic message or variants 
thereof; 

means for taking appropriate action, responsive to the 
scanning means, further comprising a means for storing 
the extracted detection data; and 
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means for storing the extracted detection data; 

wherein the extracting means comprise means for extract- 
ing a signature from the at least one specific electronic 
message; 

wherein the storing means comprise means for storing the 5 
signature in at least one signature database; 

wherein the signature database comprises a plurality of 
signature clusters, each cluster including data corre- 
sponding to substantially similar electronic messages; 1Q 

wherein each of the signature clusters comprises a char- 
acter sequence component having scanning informa- 
tion and an archetype component having identification 
information about particular signature variants; and 

wherein the scanning information includes a search char- 15 
acter sequence for a particular electronic message and 
extended character sequence information for all the 
electronic messages represented in the cluster and 
wherein the identification information includes a 
pointer to a full text stored copy of an electronic 20 
message relating to a particular signature variant, a 
hashblock of the electronic message, and alert data 
corresponding to specific instances where a copy of the 
electronic message was received and the proliferation 
of which was reported as undesirable by an alert user. 25 

35. The system of claim 34 wherein the extracting means 
comprise: 

means for scanning the specific electronic message for 
any signatures in the at least one signature database; 
and 

means for comparing, responsive to finding a matching 
signature by the scanning means, the matching signa- 
ture to each message variant in a matching cluster. 

36. The system of claim 35 wherein the comparing means 
comprise: 

means for computing a hashblock for the specific elec- 
tronic message; and 

means for comparing the computed hashblock with vari- 
ant hashblocks in the identification information of each 40 
archetype component. 

37. The system of claim 36 further comprising: 
means, if an exact variant hashblock match is found, for 

retrieving the full text stored copy of the variant match 
using the pointer, 45 

means, if the full text stored copy of the variant match and 
the full text of the specific electronic message are 
deemed sufficiently similar to regard the specific elec- 
tronic message as an instance of the variant, for extract- 
ing alert data from the specific electronic message and 50 
adding it to the alert data for the variant match; and 

means, else if an exact variant hashblock match is not 
found or the full text of the specific electronic message 
is found to be insufficiently similar to any of the 
variants in the database, for determining whether the 
specific electronic message is sufficiently similar to any 
existing cluster; 

means, if the specific electronic message is sufficiently 
similar to an existing cluster, for computing new iden- 6Q 
tification information associated with specific elec- 
tronic message; and 

means, else if the specific electronic message is not 
determined to be sufficiently similar to an existing 
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cluster, for creating a new cluster for the specific 
electronic message. 

38. The system of claim 37 wherein the determining 
means comprise: 

means for computing a checksum of a region of the 
specific electronic message indicated in the extended 
character sequence information for each cluster; and 

means for comparing the computed checksum with a 
stored checksum in the extended character sequence 
information of each cluster. 

39. The system of claim 35 further comprising means for 
creating, if no signature match is found, a new cluster for the 
specific electronic message. 

40. The system of claim 38 wherein the extended char- 
acter sequence information includes a beginofiset field, a 
regionlength field and a CRC field, the system further 
comprising: 

means for determining, for each cluster, a matching region 
with a longest regionlength; and 

means for identifying, if the longest regionlength among 
all the clusters is at least equal to a specified threshold 
length, a longest regionlength cluster as an archetype 
cluster to which the specific electronic message arche- 
type is to be added. 

41. The system of claim 39 further comprising means for 
recomputing the scanning information of the identified clus- 
ter. 

42. A system for hindering an undesirable transmission or 
receipt of electronic messages within a network of users, 
comprising: 

means for determining that transmission or receipt of at 
least one specific electronic message is undesirable; 

means for automatically extracting detection data that 
permits detection of the at least one specific electronic 
message or variants thereof; 

means for scanning one or more inbound and/or outbound 
messages from at least one user for the presence of the 
at least one specific electronic message or variants 
thereof; 

means for taking appropriate action, responsive to the 

scanning means; 
wherein the scanning means comprise: 
means for extracting a message body; 
means for transforming the message body into an 

invariant form; 
means for scanning the invariant form for exact or near 

matches to the detection data; and 
means for determining, for each match, a level of 
match, and 
wherein the determining means comprise: 

means for finding the longest regional matches for each 
match; 

means for computing hashblock similarities between a 
hashblock of the scanned message and hashblocks of 
each of the extracted detection data; 
means for receiving one or more user preferences; and 
means for determining a level of match responsive to 
the finding, computing and receiving steps. 
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